Setting global options and initialization of the seed variable, which is necessary for ensuring reproducibility.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
set.seed(23)
Loading the libraries used in the project.
library(knitr)
library(xlsx)
library(lubridate)
library(dplyr)
library(tidyr)
library(arules)
library(DT)
library(kableExtra)
library(plotly)
library(ggplot2)
library(corrplot)
library(codebook)
The dataset is based on the medical records of 375 patients suffering from COVID-19 collected between 10 January and 18 February 2020 in Tongji Hospital in China. The medical records were gathered by using standard case report forms, which included clinical, epidemiological, demographic, laboratory and mortality outcome information. Data collecting methods and some basic statistics are presented in Tan et al article published in Nature, 14 May 2020.
This section covers loading source data, initial remarks about its content, cleaning phase and detailed dataset description.
The dataset can be downloaded under this link. In this project it’s stored in a local subdirectory to ensure availability, even when it’s deleted from the original source.
Loading source data into a dataframe without any pre clean-up with caching option checked.
path <- '../res/wuhan_blood_sample_data_Jan_Feb_2020.xlsx'
df_raw <- read.xlsx(path, sheetIndex=1, header=TRUE)
| rows | columns | unique RE_DATE | NAs in RE_DATE | NAs in PATIENT_ID |
|---|---|---|---|---|
| 6120 | 81 | 3092 | 14 | 5745 |
In order to get a distinctive key for each line we will combine RE_DATE with PATIENT_ID values. In the first step we will remove rows with NA in RE_DATE, because after further checking we can see that all of the blood test results columns are empty for these lines. Then we will fill the PATIENT_ID column.
all(sapply(df_raw[is.na(df_raw$RE_DATE), c(8:81)], is.na))
## [1] TRUE
df <- df_raw %>%
filter(!is.na(RE_DATE)) %>%
fill(PATIENT_ID)
nrow(df)
## [1] 6106
length(unique(paste(df$PATIENT_ID, df$RE_DATE)))
## [1] 6106
The columns can be divided into 2 categories:
Short notes about the first 7 columns, containing information about a patient, are shown below.
## 'data.frame': 6106 obs. of 7 variables:
## $ PATIENT_ID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ RE_DATE : POSIXct, format: "2020-01-31 01:09:00" "2020-01-31 01:25:00" ...
## $ age : num 73 73 73 73 73 73 73 73 73 73 ...
## $ gender : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Admission.time: POSIXct, format: "2020-01-30 22:12:47" "2020-01-30 22:12:47" ...
## $ Discharge.time: POSIXct, format: "2020-02-17 12:40:08" "2020-02-17 12:40:08" ...
## $ outcome : num 0 0 0 0 0 0 0 0 0 0 ...
As we can see, the gender and outcome of treatment columns are encoded with numbers. Both of these attributes have only 2 unique values, so we will transform them into factors for more clarity. The information about encoding was found in the article mentioned before, which was used as a reference in this analysis.
df <- df %>%
mutate(gender = factor(gender, labels = c('male', 'female')),
outcome = factor(outcome, labels = c('survived', 'deceased')))
The next table presents a summary for the remaining columns, which store results of specific blood tests.
| Hypersensitive.cardiac.troponinI | hemoglobin | Serum.chloride | Prothrombin.time | procalcitonin | eosinophils… | Interleukin.2.receptor | Alkaline.phosphatase | albumin | basophil… | Interleukin.10 | Total.bilirubin | Platelet.count | monocytes… | antithrombin | Interleukin.8 | indirect.bilirubin | Red.blood.cell.distribution.width. | neutrophils… | total.protein | Quantification.of.Treponema.pallidum.antibodies | Prothrombin.activity | HBsAg | mean.corpuscular.volume | hematocrit | White.blood.cell.count | Tumor.necrosis.factorα | mean.corpuscular.hemoglobin.concentration | fibrinogen | Interleukin.1β | Urea | lymphocyte.count | PH.value | Red.blood.cell.count | Eosinophil.count | Corrected.calcium | Serum.potassium | glucose | neutrophils.count | Direct.bilirubin | Mean.platelet.volume | ferritin | RBC.distribution.width.SD | Thrombin.time | X…lymphocyte | HCV.antibody.quantification | D.D.dimer | Total.cholesterol | aspartate.aminotransferase | Uric.acid | HCO3. | calcium | Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP. | Lactate.dehydrogenase | platelet.large.cell.ratio. | Interleukin.6 | Fibrin.degradation.products | monocytes.count | PLT.distribution.width | globulin | γ.glutamyl.transpeptidase | International.standard.ratio | basophil.count… | X2019.nCoV.nucleic.acid.detection | mean.corpuscular.hemoglobin. | Activation.of.partial.thromboplastin.time | High.sensitivity.C.reactive.protein | HIV.antibody.quantification | serum.sodium | thrombocytocrit | ESR | glutamic.pyruvic.transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.9 | Min. : 6.4 | Min. : 71.50 | Min. : 11.50 | Min. : 0.020 | Min. :0.000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.00 | Min. : 5.00 | Min. : 2.50 | Min. : -1.0 | Min. : 0.300 | Min. : 20.00 | Min. : 5.000 | Min. : 0.100 | Min. :10.60 | Min. : 1.7 | Min. :31.80 | Min. : 0.020 | Min. : 6.00 | Min. : 0.000 | Min. : 61.60 | Min. :14.50 | Min. : 0.13 | Min. : 4.00 | Min. :286.0 | Min. : 0.500 | Min. : 5.00 | Min. : 0.800 | Min. : 0.000 | Min. :5.000 | Min. : 0.100 | Min. :0.000 | Min. :1.650 | Min. : 2.760 | Min. : 1.000 | Min. : 0.06 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.000 | Min. :0.020 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 43.0 | Min. : 6.30 | Min. :1.170 | Min. : 5 | Min. : 110.0 | Min. :11.20 | Min. : 1.500 | Min. : 4.00 | Min. : 0.010 | Min. : 8.00 | Min. :10.10 | Min. : 3.00 | Min. : 0.840 | Min. :0.000 | Min. :-1 | Min. :20.4 | Min. : 21.80 | Min. : 0.10 | Min. :0.05 | Min. :115.4 | Min. :0.010 | Min. : 1.00 | Min. : 5.00 | Min. : 2.00 | Min. : 11.00 | |
| 1st Qu.: 4.4 | 1st Qu.:113.0 | 1st Qu.: 99.05 | 1st Qu.: 13.60 | 1st Qu.: 0.040 | 1st Qu.:0.000 | 1st Qu.: 459.5 | 1st Qu.: 54.00 | 1st Qu.:27.40 | 1st Qu.:0.10 | 1st Qu.: 5.00 | 1st Qu.: 7.40 | 1st Qu.:109.0 | 1st Qu.: 2.800 | 1st Qu.: 74.00 | 1st Qu.: 8.675 | 1st Qu.: 3.800 | 1st Qu.:12.00 | 1st Qu.:65.1 | 1st Qu.:61.00 | 1st Qu.: 0.040 | 1st Qu.: 65.00 | 1st Qu.: 0.000 | 1st Qu.: 86.90 | 1st Qu.:33.50 | 1st Qu.: 4.94 | 1st Qu.: 6.70 | 1st Qu.:333.0 | 1st Qu.: 3.050 | 1st Qu.: 5.00 | 1st Qu.: 4.000 | 1st Qu.: 0.460 | 1st Qu.:6.000 | 1st Qu.: 3.680 | 1st Qu.:0.000 | 1st Qu.:2.270 | 1st Qu.: 3.950 | 1st Qu.: 5.550 | 1st Qu.: 3.09 | 1st Qu.: 3.225 | 1st Qu.:10.10 | 1st Qu.: 377.2 | 1st Qu.: 38.50 | 1st Qu.: 15.60 | 1st Qu.: 3.925 | 1st Qu.:0.040 | 1st Qu.: 0.602 | 1st Qu.:3.010 | 1st Qu.: 19.50 | 1st Qu.: 183.2 | 1st Qu.:21.00 | 1st Qu.:1.980 | 1st Qu.: 150 | 1st Qu.: 218.0 | 1st Qu.:25.60 | 1st Qu.: 4.772 | 1st Qu.: 4.00 | 1st Qu.: 0.270 | 1st Qu.:11.10 | 1st Qu.:29.70 | 1st Qu.: 22.00 | 1st Qu.: 1.030 | 1st Qu.:0.010 | 1st Qu.:-1 | 1st Qu.:29.7 | 1st Qu.: 35.30 | 1st Qu.: 5.70 | 1st Qu.:0.07 | 1st Qu.:137.7 | 1st Qu.:0.150 | 1st Qu.: 14.00 | 1st Qu.: 16.00 | 1st Qu.: 63.58 | 1st Qu.: 58.00 | |
| Median : 20.6 | Median :125.0 | Median :102.10 | Median : 14.80 | Median : 0.100 | Median :0.100 | Median : 676.5 | Median : 69.50 | Median :32.20 | Median :0.20 | Median : 5.90 | Median : 10.70 | Median :178.0 | Median : 5.700 | Median : 86.00 | Median : 16.000 | Median : 5.400 | Median :12.60 | Median :82.4 | Median :65.90 | Median : 0.050 | Median : 81.00 | Median : 0.010 | Median : 90.10 | Median :36.60 | Median : 7.72 | Median : 8.60 | Median :343.0 | Median : 4.120 | Median : 5.00 | Median : 5.985 | Median : 0.800 | Median :6.500 | Median : 4.140 | Median :0.010 | Median :2.360 | Median : 4.410 | Median : 6.990 | Median : 5.85 | Median : 4.800 | Median :10.80 | Median : 711.0 | Median : 40.90 | Median : 16.80 | Median :11.450 | Median :0.060 | Median : 2.155 | Median :3.630 | Median : 27.00 | Median : 243.7 | Median :23.50 | Median :2.080 | Median : 585 | Median : 340.0 | Median :30.90 | Median : 19.265 | Median : 17.90 | Median : 0.410 | Median :12.40 | Median :32.70 | Median : 34.00 | Median : 1.140 | Median :0.010 | Median :-1 | Median :30.9 | Median : 39.20 | Median : 51.50 | Median :0.09 | Median :140.4 | Median :0.210 | Median : 28.00 | Median : 24.00 | Median : 87.90 | Median : 76.00 | |
| Mean : 1223.2 | Mean :123.1 | Mean :103.14 | Mean : 16.68 | Mean : 1.107 | Mean :0.629 | Mean : 907.2 | Mean : 82.47 | Mean :32.01 | Mean :0.21 | Mean : 16.07 | Mean : 16.70 | Mean :184.3 | Mean : 6.155 | Mean : 85.32 | Mean : 83.088 | Mean : 6.889 | Mean :13.07 | Mean :77.6 | Mean :65.30 | Mean : 0.132 | Mean : 78.55 | Mean : 8.306 | Mean : 90.39 | Mean :36.55 | Mean : 15.60 | Mean : 11.58 | Mean :342.8 | Mean : 4.294 | Mean : 6.51 | Mean : 9.589 | Mean : 1.017 | Mean :6.484 | Mean : 9.288 | Mean :0.039 | Mean :2.355 | Mean : 4.509 | Mean : 8.889 | Mean : 7.81 | Mean : 9.887 | Mean :10.91 | Mean : 1379.1 | Mean : 42.44 | Mean : 18.17 | Mean :15.392 | Mean :0.117 | Mean : 7.943 | Mean :3.689 | Mean : 46.53 | Mean : 276.1 | Mean :23.14 | Mean :2.078 | Mean : 3669 | Mean : 474.2 | Mean :31.77 | Mean : 112.308 | Mean : 61.35 | Mean : 0.526 | Mean :13.01 | Mean :33.24 | Mean : 55.34 | Mean : 1.313 | Mean :0.017 | Mean :-1 | Mean :31.0 | Mean : 41.52 | Mean : 76.24 | Mean :0.10 | Mean :141.6 | Mean :0.212 | Mean : 33.69 | Mean : 38.86 | Mean : 81.56 | Mean : 109.93 | |
| 3rd Qu.: 223.8 | 3rd Qu.:137.0 | 3rd Qu.:105.65 | 3rd Qu.: 16.70 | 3rd Qu.: 0.405 | 3rd Qu.:0.800 | 3rd Qu.:1155.5 | 3rd Qu.: 95.00 | 3rd Qu.:36.60 | 3rd Qu.:0.30 | 3rd Qu.: 12.35 | 3rd Qu.: 16.77 | 3rd Qu.:248.0 | 3rd Qu.: 8.600 | 3rd Qu.: 97.00 | 3rd Qu.: 35.200 | 3rd Qu.: 8.000 | 3rd Qu.:13.70 | 3rd Qu.:92.3 | 3rd Qu.:70.45 | 3rd Qu.: 0.070 | 3rd Qu.: 95.00 | 3rd Qu.: 0.010 | 3rd Qu.: 93.90 | 3rd Qu.:39.90 | 3rd Qu.: 12.72 | 3rd Qu.: 11.50 | 3rd Qu.:350.0 | 3rd Qu.: 5.480 | 3rd Qu.: 5.00 | 3rd Qu.:11.400 | 3rd Qu.: 1.310 | 3rd Qu.:7.294 | 3rd Qu.: 4.650 | 3rd Qu.:0.060 | 3rd Qu.:2.440 | 3rd Qu.: 4.870 | 3rd Qu.:10.260 | 3rd Qu.:10.95 | 3rd Qu.: 8.275 | 3rd Qu.:11.50 | 3rd Qu.: 1425.2 | 3rd Qu.: 44.70 | 3rd Qu.: 18.38 | 3rd Qu.:24.975 | 3rd Qu.:0.090 | 3rd Qu.:21.000 | 3rd Qu.:4.265 | 3rd Qu.: 42.00 | 3rd Qu.: 333.8 | 3rd Qu.:25.90 | 3rd Qu.:2.190 | 3rd Qu.: 2625 | 3rd Qu.: 601.8 | 3rd Qu.:37.20 | 3rd Qu.: 60.167 | 3rd Qu.:150.00 | 3rd Qu.: 0.580 | 3rd Qu.:14.30 | 3rd Qu.:36.50 | 3rd Qu.: 58.00 | 3rd Qu.: 1.330 | 3rd Qu.:0.020 | 3rd Qu.:-1 | 3rd Qu.:32.2 | 3rd Qu.: 44.12 | 3rd Qu.:118.50 | 3rd Qu.:0.11 | 3rd Qu.:143.5 | 3rd Qu.:0.270 | 3rd Qu.: 45.50 | 3rd Qu.: 41.00 | 3rd Qu.:103.97 | 3rd Qu.: 98.25 | |
| Max. :50000.0 | Max. :178.0 | Max. :140.40 | Max. :120.00 | Max. :57.170 | Max. :8.600 | Max. :7500.0 | Max. :620.00 | Max. :48.60 | Max. :1.70 | Max. :1000.00 | Max. :505.70 | Max. :558.0 | Max. :53.000 | Max. :136.00 | Max. :6795.000 | Max. :145.100 | Max. :27.10 | Max. :98.9 | Max. :88.70 | Max. :11.950 | Max. :142.00 | Max. :250.000 | Max. :118.90 | Max. :52.30 | Max. :1726.60 | Max. :168.00 | Max. :514.0 | Max. :10.780 | Max. :88.50 | Max. :68.400 | Max. :52.420 | Max. :7.565 | Max. :749.500 | Max. :0.490 | Max. :2.790 | Max. :12.800 | Max. :43.010 | Max. :33.88 | Max. :360.600 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :161.90 | Max. :60.000 | Max. :2.090 | Max. :60.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :36.30 | Max. :2.620 | Max. :70000 | Max. :1867.0 | Max. :62.20 | Max. :5000.000 | Max. :190.80 | Max. :39.920 | Max. :25.30 | Max. :50.60 | Max. :732.00 | Max. :13.480 | Max. :0.120 | Max. :-1 | Max. :50.8 | Max. :144.00 | Max. :320.00 | Max. :0.27 | Max. :179.7 | Max. :0.510 | Max. :110.00 | Max. :1600.00 | Max. :224.00 | Max. :1497.00 | |
| NA’s :5599 | NA’s :5131 | NA’s :5131 | NA’s :5444 | NA’s :5647 | NA’s :5149 | NA’s :5838 | NA’s :5176 | NA’s :5172 | NA’s :5149 | NA’s :5839 | NA’s :5176 | NA’s :5149 | NA’s :5148 | NA’s :5776 | NA’s :5838 | NA’s :5200 | NA’s :5183 | NA’s :5149 | NA’s :5175 | NA’s :5827 | NA’s :5447 | NA’s :5827 | NA’s :5149 | NA’s :5149 | NA’s :4979 | NA’s :5838 | NA’s :5149 | NA’s :5540 | NA’s :5838 | NA’s :5170 | NA’s :5149 | NA’s :5722 | NA’s :4979 | NA’s :5149 | NA’s :5192 | NA’s :5126 | NA’s :5331 | NA’s :5149 | NA’s :5176 | NA’s :5244 | NA’s :5823 | NA’s :5183 | NA’s :5540 | NA’s :5148 | NA’s :5827 | NA’s :5476 | NA’s :5175 | NA’s :5171 | NA’s :5172 | NA’s :5172 | NA’s :5127 | NA’s :5631 | NA’s :5172 | NA’s :5244 | NA’s :5834 | NA’s :5776 | NA’s :5149 | NA’s :5244 | NA’s :5176 | NA’s :5176 | NA’s :5447 | NA’s :5149 | NA’s :5605 | NA’s :5149 | NA’s :5538 | NA’s :5369 | NA’s :5828 | NA’s :5131 | NA’s :5244 | NA’s :5723 | NA’s :5175 | NA’s :5170 | NA’s :5170 |
The author of the original dataset decided to label columns with full-length names of the blood tests, which are too long and incomprehensible for our purposes. Also in some of names occur special symbols, which were degenerated during data loading. To obtain easy to work with column names the following steps were performed:
df <- df %>%
rename("hs.cTnI" = "Hypersensitive.cardiac.troponinI") %>%
rename("PT" = "Prothrombin.time") %>%
rename("IL.2.receptor" = "Interleukin.2.receptor") %>%
rename("ALP" = "Alkaline.phosphatase") %>%
rename("IL.10" = "Interleukin.10") %>%
rename("AT" = "antithrombin") %>%
rename("IL.8" = "Interleukin.8") %>%
rename("RDW" = "Red.blood.cell.distribution.width.") %>%
rename("q.anti.TP" = "Quantification.of.Treponema.pallidum.antibodies") %>%
rename("MCV" = "mean.corpuscular.volume") %>%
rename("WBC.count" = "White.blood.cell.count") %>%
rename("TNF.alpha" = "Tumor.necrosis.factorα") %>%
rename("MCHC" = "mean.corpuscular.hemoglobin.concentration") %>%
rename("IL.1.beta" = "Interleukin.1β") %>%
rename("RBC.count" = "Red.blood.cell.count") %>%
rename("MPV" = "Mean.platelet.volume") %>%
rename("RBC.DW.SD" = "RBC.distribution.width.SD") %>%
rename("TT" = "Thrombin.time") %>%
rename("q.anti.HCV" = "HCV.antibody.quantification") %>%
rename("D.Dimer" = "D.D.dimer") %>%
rename("AST" = "aspartate.aminotransferase") %>%
rename("NT.proBNP" = "Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP.") %>%
rename("LDH" = "Lactate.dehydrogenase") %>%
rename("P.LCR" = "platelet.large.cell.ratio.") %>%
rename("IL.6" = "Interleukin.6") %>%
rename("FDP" = "Fibrin.degradation.products") %>%
rename("PDW" = "PLT.distribution.width") %>%
rename("GGT" = "γ.glutamyl.transpeptidase") %>%
rename("INR" = "International.standard.ratio") %>%
rename("X2019.nCoV.NAT" = "X2019.nCoV.nucleic.acid.detection") %>%
rename("MCH" = "mean.corpuscular.hemoglobin.") %>%
rename("aPTT" = "Activation.of.partial.thromboplastin.time") %>%
rename("hs.CRP" = "High.sensitivity.C.reactive.protein") %>%
rename("q.anti.HIV" = "HIV.antibody.quantification") %>%
rename("ALAT" = "glutamic.pyruvic.transaminase") %>%
rename_with(~gsub('_', '.', .)) %>%
rename_with(~gsub('\\.{3}$', '.percent', .)) %>%
rename_with(~gsub('^X\\.{3}|\\.$', '', .)) %>%
rename_with(tolower, matches("^[[:upper:]]{1}[[:lower:]]+|^[[:upper:]]{2,}\\.{1}[[:upper:]]{2,}$", ignore.case = FALSE))
Obtained names are presented below.
## [1] "patient.id" "re.date" "age"
## [4] "gender" "admission.time" "discharge.time"
## [7] "outcome" "hs.cTnI" "hemoglobin"
## [10] "serum.chloride" "PT" "procalcitonin"
## [13] "eosinophils.percent" "IL.2.receptor" "ALP"
## [16] "albumin" "basophil.percent" "IL.10"
## [19] "total.bilirubin" "platelet.count" "monocytes.percent"
## [22] "AT" "IL.8" "indirect.bilirubin"
## [25] "RDW" "neutrophils.percent" "total.protein"
## [28] "q.anti.TP" "prothrombin.activity" "HBsAg"
## [31] "MCV" "hematocrit" "WBC.count"
## [34] "TNF.alpha" "MCHC" "fibrinogen"
## [37] "IL.1.beta" "urea" "lymphocyte.count"
## [40] "PH.value" "RBC.count" "eosinophil.count"
## [43] "corrected.calcium" "serum.potassium" "glucose"
## [46] "neutrophils.count" "direct.bilirubin" "MPV"
## [49] "ferritin" "RBC.DW.SD" "TT"
## [52] "lymphocyte" "q.anti.HCV" "D.Dimer"
## [55] "total.cholesterol" "AST" "uric.acid"
## [58] "HCO3" "calcium" "NT.proBNP"
## [61] "LDH" "P.LCR" "IL.6"
## [64] "FDP" "monocytes.count" "PDW"
## [67] "globulin" "GGT" "INR"
## [70] "basophil.count.percent" "X2019.nCoV.NAT" "MCH"
## [73] "aPTT" "hs.CRP" "q.anti.HIV"
## [76] "serum.sodium" "thrombocytocrit" "ESR"
## [79] "ALAT" "eGFR" "creatinine"
All of these attributes are of numeric type. In the summary we could observe that in one column some values are equal to -1. According to the article, this number was selected as a label for missing values. After checking the whole content of that column, we can see that there are only two different values {NA, -1}, which indicates that it’s filled only with missing values, thus should be removed from the dataset.
## [1] "X2019.nCoV.NAT"
## [1] NA -1
df <- df %>% select(-c(colnames(df)[apply(df, 2, function(x) any(x == -1, na.rm = TRUE))]))
The number of lines per patient is clearly inconstant, which suggests that every patient could have undergone various combinations of specific blood tests. Also, it’s common knowledge that one sample of drawn blood is enough to perform multiple tests. We would like to check if some of the blood tests available in our dataset were more likely to be performed together.
Firstly, let’s check how many unique blood tests combinations are present in the dataset.
trans <- unique(apply(!is.na(df[, 8:80]), 1, which))
length(trans)
## [1] 175
Now we will try to find closed frequent itemsets with apriori algorithm. The maximum number of items per itemset is set to 25. The maximum subset checking time is 60 seconds, support = 0.09 and confidence = 0.8.
freq_itemsets <- apriori(trans, parameter = list(support=0.09, maxlen=25, maxtime=60, target="closed frequent itemsets"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 60 0.09 1
## maxlen target ext
## 25 closed frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[73 item(s), 175 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 done [29.94s].
## filtering closed item sets ... done [19.79s].
## sorting transactions ... done [0.00s].
## writing ... [162 set(s)] done [0.34s].
## creating S4 object ... done [0.51s].
max_freq_itemsets <- freq_itemsets[is.maximal(freq_itemsets)]
itemLabels(max_freq_itemsets) <- colnames(df[, 8:80])
inspect(sort(max_freq_itemsets, by='support'))
## items support transIdenticalToItemsets count
## [1] {hs.cTnI} 0.18285714 0.000000000 32
## [2] {procalcitonin,
## NT.proBNP} 0.14857143 0.000000000 26
## [3] {q.anti.TP,
## HBsAg,
## q.anti.HCV,
## q.anti.HIV} 0.11428571 0.000000000 20
## [4] {serum.chloride,
## ALP,
## albumin,
## total.bilirubin,
## indirect.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## glucose,
## direct.bilirubin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## hs.CRP,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.11428571 0.040000000 20
## [5] {PT,
## AT,
## prothrombin.activity,
## fibrinogen,
## TT,
## D.Dimer,
## FDP,
## INR,
## aPTT} 0.10857143 0.000000000 19
## [6] {ferritin,
## hs.CRP} 0.09714286 0.005714286 17
## [7] {hemoglobin,
## eosinophils.percent,
## basophil.percent,
## platelet.count,
## monocytes.percent,
## neutrophils.percent,
## MCV,
## hematocrit,
## WBC.count,
## MCHC,
## lymphocyte.count,
## RBC.count,
## eosinophil.count,
## neutrophils.count,
## lymphocyte,
## monocytes.count,
## basophil.count.percent,
## MCH} 0.09714286 0.000000000 17
## [8] {procalcitonin,
## glucose} 0.09142857 0.022857143 16
## [9] {serum.chloride,
## PT,
## prothrombin.activity,
## serum.potassium,
## calcium,
## INR,
## serum.sodium} 0.09142857 0.000000000 16
## [10] {ALP,
## albumin,
## total.bilirubin,
## indirect.bilirubin,
## total.protein,
## urea,
## direct.bilirubin,
## ferritin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## LDH,
## globulin,
## GGT,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
## [11] {serum.chloride,
## ALP,
## albumin,
## total.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## direct.bilirubin,
## ferritin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
## [12] {serum.chloride,
## procalcitonin,
## ALP,
## albumin,
## total.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## direct.bilirubin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## hs.CRP,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
The algorithm returned 12 closed frequent itemsets. Analyzing the statistics above, we can draw some conclusions:
For now the percent of NA values in the dataset is too high.
(sum(is.na(df)) / (nrow(df) * ncol(df))) * 100
## [1] 79.9435
To minimize the number of missing values, we will collapse all rows linked to one patient with tests’ results, which were released on the same day. In case of multiple results of the same test during one day, the latest will be saved.
df_clean <- df %>%
mutate(re.date = floor_date(re.date, unit = "day")) %>%
group_by(patient.id, re.date) %>%
summarize(across(age:outcome, last), across(hs.cTnI:creatinine, ~.[last(which(!is.na(.)))])) %>%
ungroup()
Let’s remove patients who have not been tested for at least 50% of all available tests.
ids <- df_clean %>%
group_by(patient.id) %>%
summarize(across(hs.cTnI:creatinine, ~any(!is.na(.)))) %>%
summarize(patient.id = patient.id, n = rowSums(select(., c(hs.cTnI:creatinine)))) %>%
ungroup() %>%
filter(n > ncol(select(df_clean, c(hs.cTnI:creatinine))) * 0.5)
df_clean <- df_clean %>%
filter(patient.id %in% ids$patient.id)
(sum(is.na(df_clean)) / (nrow(df_clean) * ncol(df_clean))) * 100
## [1] 51.36116
As we can see, the percentage of NA values decreased from approximately 80% to 51%, which is still high, but deleting even more records would significantly reduce the size of the dataset.
Final version of the dataset contains 1698 rows and 80 columns. Each row represents a set of blood tests results, which were performed for a single patient on a given day.
dim(df_clean)
## [1] 1698 80
Detailed description of individual columns of the dataset with some basic statistics are introduced in the codebook below.
A sample of cleaned data is presented below.
This section is dedicated to attribute values analysis and visualization. It covers not only separate study of each variable but also attempts of finding correlation between them.
The following chart presents how many blood tests of a given type were performed during each day from January 10 to February 18, 2020.
At first glance, we observe a significant rise in the total number of tests after January 27, which suggests that from this day forward more people infected with COVID-19 were admitted to the hospital on a daily basis. Another fact is that before that day the variety of blood test types, which were performed, was visibly lower. Throughout the whole considered period, the most frequently carried out exams were tests included in Complete Blood Count, Blood Biochemistry and glucose level categories.To confirm the suspicion of a growing number of admitted patients starting on January 27, let’s construct a chart visualizing the number of admitted and discharged patients on a given day.
The number of patients admitted to the hospital on a single day peaked on February 10, whereas the highest number of discharged patients was observed on February 18.This section is broke into 3 parts: analyzing data concerning patient’s personal information, blood test results for each type and correlation between them.
The following overlaying histograms portray age distribution of COVID-19 survivors and victims.
The majority of patients, who died for discussed disease, were people aged 65 years or older. Only two young people in age ranges <20 and 30-34 were an exception from this trend. Reported survivors were mainly people under 70 years.Based on the assumption that potential readers of this analysis are not proficient in interpreting blood test results, calculating basic statistics for each column won’t be very helpful in understanding what they mean. To avoid this, the data in each column was grouped by the outcome of treatment. Comparing results of these two classes gives us a useful point of view in deciding, which value ranges can be considered abnormal for a specific variable.
TODO
In order to check if some of variables with blood test results values are correlated, we calculate Pearson correlation coefficient for columns [8:80] obtaining correlation matrix. Beside the value of the correlation coefficient we will take under consideration significance level as well. To do so, we calculate p-values and store them in another matrix.
The plot below presents a correlogram combined with the significance test sorted by the coefficient value. The diagonal is omitted for better readability. In the above figure, correlations with p-value > 5% are considered as insignificant. In this case the correlation coefficient values are left blank.
We observe on the correlogram the following facts:
TODO
This section describes preprocessing of the dataset and build, training and validation of the model for a classifier.
TODO
TODO
TODO
TODO